Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release dependent modules only after the worker has finished for scheduled modules #39245

Merged
merged 2 commits into from
Sep 23, 2022

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Aug 29, 2022

PR description:

In #39064 (comment) we found that the PuttableProductResolver::putProduct() releasing the task(s) for dependent modules can lead to rare problems if the producing module produces two (or more) products that refer to each other. Specifically, a consuming module may get run when only one of the products has been put in place, that can lead to deference of a Ref to fail on "missing product".

All consuming modules call the prefetchAsync_(), which for PuttableProductResolver already creates a task to release the dependent modules (m_waitingTasks), and inserts that into the Worker's WaitingTaskList (that is released after the Worker finishes). I think we could just rely on that , and therefore this PR removes the override of putProduct() from PuttableProductResolver. This assumption turned out to be wrong, because there is also another use case for PuttableProductResolver: PuttableSourceBase sources (and TestProcessor, where it is also used like for sources).

As a simple way forward to fulfill the requirements of both use cases, this PR makes PuttableProductResolver to directly use the Worker's WaitingTaskList in prefetching, if a Worker has been associated to the PuttableProductResolver. If no Worker has been associated, the PuttableProductResolver assumes it is from a Source (or similar), and the product is available for any consuming module without any further waiting in the prefetching. See also #39245 (comment) for some further discussion.

Resolves #39064 .

PR validation:

The reproducer I mentioned in #39064 (comment) succeeds with this change. Framework unit tests pass.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39245/31880

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • FWCore/Framework (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks.
@wddgit this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable threading

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

@Dr15Jones Would you agree?

@Dr15Jones
Copy link
Contributor

I'm not sure that works for the mixing module which doesn't (I think) do any prefetching.

@makortel
Copy link
Contributor Author

Do you mean the SecondaryEventProvider that is being used by PileUp

if (pset.existsAs<std::vector<ParameterSet> >("producers", true)) {
std::vector<ParameterSet> producers = pset.getParameter<std::vector<ParameterSet> >("producers");
provider_ = std::make_unique<SecondaryEventProvider>(producers, *productRegistry_, processConfiguration_);
}
that is being used by BMixingModule
// input, cosmics, beamhalo_plus, beamhalo_minus
std::vector<std::shared_ptr<PileUp>> inputSources_;

?

On a first look SecondaryEventProvider sets up the workers as unscheduled

SecondaryEventProvider::SecondaryEventProvider(std::vector<ParameterSet>& psets,
ProductRegistry& preg,
std::shared_ptr<ProcessConfiguration> processConfiguration)
: exceptionToActionTable_(new ExceptionToActionTable),
workerManager_(std::make_shared<ActivityRegistry>(), *exceptionToActionTable_) {
std::vector<std::string> shouldBeUsedLabels;
std::set<std::string> unscheduledLabels;
const PreallocationConfiguration preallocConfig;
for (auto& pset : psets) {
std::string label = pset.getParameter<std::string>("@module_label");
workerManager_.addToUnscheduledWorkers(
pset, preg, &preallocConfig, processConfiguration, label, unscheduledLabels, shouldBeUsedLabels);
}
if (!unscheduledLabels.empty()) {
preg.setUnscheduledProducts(unscheduledLabels);
}
} // SecondaryEventProvider::SecondaryEventProvider

But maybe there is some hidden detail that deserves deeper look. Unfortunately this won't be caught by any tests as the functionality has been unused for some years (and Mixing/Base has no tests).

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals RelVals-THREADING
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cb5f3c/27179/summary.html
COMMIT: edbadba
CMSSW: CMSSW_12_6_X_2022-08-29-1100/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39245/27179/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test TestFWCoreFrameworkTrigBit had ERRORS

RelVals

The relvals timed out after 4 hours.

RelVals-THREADING

The relvals timed out after 4 hours.

@makortel
Copy link
Contributor Author

Well, something gets wrong

Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 29-Aug-2022 22:49:57.257 CEST


A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.


---> test TestFWCoreFrameworkTrigBit had ERRORS
Mon Aug 29 23:49:32 CEST 2022

@makortel
Copy link
Contributor Author

In addition, specifically HARVESTING step seems to get stuck (because of starvation of tasks in TBB)

@Dr15Jones
Copy link
Contributor

@makortel maybe the code that transitions from one module on a path to the next module on a path is adding to the module proxy's WaitingTaskList and since it is not a 'consumes' relation your change will fail? That it, it requires the release of the WaitingTaskList on the put if no module 'consumes' that data and it only runs because it is on a Path?

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39245/32217

@cmsbuild
Copy link
Contributor

Pull request #39245 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cb5f3c/27739/summary.html
COMMIT: e8a6e6a
CMSSW: CMSSW_12_6_X_2022-09-22-1100/el8_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39245/27739/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3624368
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3624341
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@makortel
Copy link
Contributor Author

type bug-fix

@perrotta
Copy link
Contributor

+1

  • Fully trusting the gurus who contributed to this PR
  • PR tests pass successfully

@cmsbuild cmsbuild merged commit fbfd2d7 into cms-sw:master Sep 23, 2022
@makortel makortel deleted the puttableProductResolverPutProduct branch September 23, 2022 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crashes in muon HLT reconstruction (reco::TrackExtra product not found)
4 participants